-
Notifications
You must be signed in to change notification settings - Fork 62
🐛 Add checks to catch race-driven panics in metrics endpoint e2e #1795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Add checks to catch race-driven panics in metrics endpoint e2e #1795
Conversation
✅ Deploy Preview for olmv1 ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
Tries to fix #1781 |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1795 +/- ##
==========================================
+ Coverage 68.34% 68.43% +0.09%
==========================================
Files 63 62 -1
Lines 5117 5117
==========================================
+ Hits 3497 3502 +5
+ Misses 1390 1387 -3
+ Partials 230 228 -2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Are there examples in the flakes for each of these assumptions not holding? If so, I think we should look at each one to understand what's happening and then decide what the correct fix is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @bentito,
Terrific work! 🚀 This is looking great.
Just a couple of minor suggestions:
- The PR description and commit history seem to be a bit out of sync with the latest changes. It would be great to update the description accordingly and squash the commits to maintain a clean history.
- The 30s timeout might be a bit short, especially considering that the --force option is used here: 🐛 Add checks to catch race-driven panics in metrics endpoint e2e #1795 (comment) Since the standard timeout across tests is 1 minute, it might be better to align with that for consistency.
Other than that, everything looks good to me! 👍
Thanks for your work on this. 🎉
// Wait for the ClusterRoleBinding to be deleted. | ||
if err := waitForDeletion(ctx, c.client, "clusterrolebinding", c.clusterBinding); err != nil { | ||
c.t.Logf("Error waiting for clusterrolebinding deletion: %v", err) | ||
} else { | ||
c.t.Log("ClusterRoleBinding deleted") | ||
} | ||
|
||
// Wait for the Pod to be deleted. | ||
if err := waitForDeletion(ctx, c.client, "pod", c.curlPodName, "-n", c.namespace); err != nil { | ||
c.t.Logf("Error waiting for pod deletion: %v", err) | ||
} else { | ||
c.t.Log("Pod deleted") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to even wait for these to be deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, good point.
But I am betting that is the fix.
In the scenarios where it fails, because the Pod get stuck, we are now just logging in instead of failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be clear: I have a pretty quick test ((takes about ~50m) -- run e2e 10 times. This test always fails on my local box when run on main
, at least one fail out of 10, usually 3-5 fails. With the changes on this PR, that same test passes. I think @camilamacedo86 is right, the change in behavior is key to the fix, that we log instead of fail. I guess I could try not waiting, but I'd rather move on as something is fixing the flake.
There’s potential for flakiness due to timing issues. For example, the test assumes that: • The ClusterRoleBinding and token are immediately effective. • The curl pod becomes ready promptly. • The metrics endpoint is available as soon as the pod is ready. Seems like these all can be covered by retry logic in the validate method. Signed-off-by: Brett Tofel <[email protected]>
1e0ac4c
to
493315f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/hold for @joelanford @camilamacedo86 or @azych to review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @bentito,
This might not fully address the root cause—why was it not possible to delete? The fix mainly occurs because we are logging instead of failing, as discussed in this comment.
That said, I don’t see much value in spending more effort on it at this point. You did a great job addressing the intermittent scenario, and we can always revisit this if needed in the future. 🥇 I don’t see why we shouldn't move forward with this approach. If further tests show that the Pod/ClusterBinding isn’t being cleaned up properly in some cases due to another reason than timeout, or when we add more tests, we face intermittent issues, then yes, we might need to look further, but now, it looks great to me! 👍
/lgtm
/approve
I share the sentiment - I'm still unsure about the root cause, but if this works and other people are good with it, I'm OK too. /lgtm |
Description
There’s potential for flakiness due to timing issues. For example, the TestOperatorControllerMetricsExportedEndpoint assumes that:
• The ClusterRoleBinding and token are immediately effective.
• The curl pod becomes ready promptly.
• The metrics endpoint is available as soon as the pod is ready.
In slower or variable environments, these assumptions might not hold, leading to intermittent failures. Adding explicit waits or retry logic for each resource’s readiness could help mitigate race conditions.
To fix this, this PR:
• Adds a generic retry helper that attempts a function up to 5 times with a 5-second delay between tries.
• Wraps key steps in the metrics endpoint tests with the retry logic:
• Creating the ClusterRoleBinding.
• Generating the ServiceAccount token.
• Creating the curl pod.
• Validating the metrics endpoint via curl.
• Refactored these steps to return errors instead of calling require.NoError immediately, so transient issues are retried before ultimately failing.
• This change minimizes test flakiness due to race conditions or temporary resource unavailability.
Currently on main, on my local machine, I ran this script:
Failed on the 2nd try on main:
I'm running the script now on this branch and will update with my findings.
I gotta make the script report more, to make sure this is the e2e actually causing the fail, but looks like no joy :-(
run script on branch:
Reviewer Checklist